Introduction

According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.

Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).

The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.

In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.

Description of Dataset

The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .

# Load the Data 
heart <-read.csv("heart_2020_cleaned.csv")

# Descriptive Statistics  
status (heart)
##                          variable q_zeros p_zeros q_na p_na q_inf p_inf
## HeartDisease         HeartDisease       0   0.000    0    0     0     0
## BMI                           BMI       0   0.000    0    0     0     0
## Smoking                   Smoking       0   0.000    0    0     0     0
## AlcoholDrinking   AlcoholDrinking       0   0.000    0    0     0     0
## Stroke                     Stroke       0   0.000    0    0     0     0
## PhysicalHealth     PhysicalHealth  226589   0.709    0    0     0     0
## MentalHealth         MentalHealth  205401   0.642    0    0     0     0
## DiffWalking           DiffWalking       0   0.000    0    0     0     0
## Sex                           Sex       0   0.000    0    0     0     0
## AgeCategory           AgeCategory       0   0.000    0    0     0     0
## Race                         Race       0   0.000    0    0     0     0
## Diabetic                 Diabetic       0   0.000    0    0     0     0
## PhysicalActivity PhysicalActivity       0   0.000    0    0     0     0
## GenHealth               GenHealth       0   0.000    0    0     0     0
## SleepTime               SleepTime       0   0.000    0    0     0     0
## Asthma                     Asthma       0   0.000    0    0     0     0
## KidneyDisease       KidneyDisease       0   0.000    0    0     0     0
## SkinCancer             SkinCancer       0   0.000    0    0     0     0
##                       type unique
## HeartDisease     character      2
## BMI                numeric   3604
## Smoking          character      2
## AlcoholDrinking  character      2
## Stroke           character      2
## PhysicalHealth     integer     31
## MentalHealth       integer     31
## DiffWalking      character      2
## Sex              character      2
## AgeCategory      character     13
## Race             character      6
## Diabetic         character      4
## PhysicalActivity character      2
## GenHealth        character      5
## SleepTime          integer     24
## Asthma           character      2
## KidneyDisease    character      2
## SkinCancer       character      2

From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.

We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.

Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.

# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)

heart %>%
  group_by(HeartDisease) %>%
  summarise_at(vars("BMI", 
                    "PhysicalHealth", 
                    "MentalHealth",
                    "SleepTime"), mean)
## # A tibble: 2 × 5
##   HeartDisease   BMI PhysicalHealth MentalHealth SleepTime
##   <chr>        <dbl>          <dbl>        <dbl>     <dbl>
## 1 No            28.2           2.96         3.83      7.09
## 2 Yes           29.4           7.81         4.64      7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No

heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"  

# Changing all the categorical variables to numerical dummies 

heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)

Gender and Lifestyle Factors

According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.

We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with, this will be addressed later. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.

Proportion of Heart Disease Among the Genders

contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for Heart Disease and Sex")
Contingency Table for Heart Disease and Sex
Female Male
0 156571 135851
1 11234 16139
addmargins(contable)
##      
##       Female   Male    Sum
##   0   156571 135851 292422
##   1    11234  16139  27373
##   Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)

#print ("Contingency Table for Heart Disease and Sex in Percentage")
print(percent, main="Contingency Table for Heart Disease and Sex in Percentage")
##    
##     Female  Male
##   0  48.96 42.48
##   1   3.51  5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)

In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.

BMI: Is there a difference between the mean BMI of men and woman in the overall population?

hist(heart$BMI,main="Histogram of BMI in Data Set", col='pink', xlab="BMI Levels")

In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.

After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.

# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male  <- subset (heart, Sex == "Male")

# T-test of BMI between Genders 
t.test(female$BMI, male$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.387 -0.299
## sample estimates:
## mean of x mean of y 
##      28.2      28.5

The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.

Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference among the population with heart issues.

BMI: Is there a difference in average BMI between the sexes with heart disease?

# Subsetting genders based on respondents that have heart disease 
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)

#T-test
t.test(female_HD$BMI, male_HD$BMI)
## 
##  Welch Two Sample t-test
## 
## data:  female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.208  0.121
## sample estimates:
## mean of x mean of y 
##      29.4      29.4

From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.

Physical Health: How many times in the last 30 days have you felt physically ill?

Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.

ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many times in the last 30 days have you felt physically ill?') +
    xlab('Heart Disease') + 
    ylab('Physical Health')+
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.53 -1.97
## sample estimates:
## mean of x mean of y 
##      6.88      9.14

The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.

Mental Health: How many times in the last 30 days have you felt mentally ill?

ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
     ggtitle('How many times in the last 30 days have you felt mentally ill?') +
    xlab('Heart Disease') + 
    ylab('Mental Health')+
    scale_fill_brewer(palette = 'Set3', name = 'Sex')

t.test  (male_HD$MentalHealth,female_HD$MentalHealth)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.79 -2.34
## sample estimates:
## mean of x mean of y 
##      3.59      6.16

The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.

Sleep Time: On average,how many hours of sleep do you get in a 24-hour period?

ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
    xlab('Heart Disease') + 
    ylab('Sleep Time')+ 
    scale_fill_brewer(palette = 'Set', name = 'Gender')

t.test(male_HD$SleepTime,female_HD$SleepTime)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0515 0.1389
## sample estimates:
## mean of x mean of y 
##      7.18      7.08

Visually it look like men and women get the same amount of sleep. However, the t-test shows that men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).

This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).

Smoking:Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]

#Smoking 

ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
    xlab('Heart Disease') + 
    ylab('Smoking')+ 
    scale_fill_brewer(palette = 'Set2', name = 'Gender')

t.test  ( male_HD$Smoking,female_HD$Smoking)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$Smoking and female_HD$Smoking
## t = 17, df = 23643, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0941 0.1178
## sample estimates:
## mean of x mean of y 
##     0.629     0.523

The number of men and women smoking in our data set are in equal proportions. However, the t-test shows that men average smoke slightly more than women but this difference does not seem significant.

Drinking: How many drinks have you had this week?

ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
    geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
    ggtitle('How many drinks have you had this week?') +
    xlab('Heart Disease') + 
    ylab('Alcohol Consumption')+ 
    scale_fill_brewer(palette = 'Set4', name = 'Gender')

t.test  ( male_HD$AlcoholDrinking,
        female_HD$AlcoholDrinking)
## 
##  Welch Two Sample t-test
## 
## data:  male_HD$AlcoholDrinking and female_HD$AlcoholDrinking
## t = 3, df = 25196, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00133 0.01083
## sample estimates:
## mean of x mean of y 
##    0.0442    0.0381

Note than in the questionnaire, heavy drinkers are defined as “adult men having more than 14 drinks per week and adult women having more than 7 drinks per week”. Even though, the number of drinkers are in equal proportions for both male and female, the number of drinks per between the populations are different. The threshold for being labelled a heavy drinker for women is half of that for a man.Therefore, the mean difference shown in the t-test might look small but on a per drink basis it is significant. Thus, we can conclude that men drink more on a weekly basis than women.

Age, Race, General Health

HD <- subset(heart, HeartDisease==1)

freq(HD$AgeCategory)

##            var frequency percentage cumulative_perc
## 1  80 or older      5449      19.91            19.9
## 2        70-74      4847      17.71            37.6
## 3        65-69      4101      14.98            52.6
## 4        75-79      4049      14.79            67.4
## 5        60-64      3327      12.15            79.5
## 6        55-59      2202       8.04            87.6
## 7        50-54      1383       5.05            92.6
## 8        45-49       744       2.72            95.3
## 9        40-44       486       1.78            97.1
## 10       35-39       296       1.08            98.2
## 11       30-34       226       0.83            99.0
## 12       25-29       133       0.49            99.5
## 13       18-24       130       0.47           100.0
freq(HD$Race)

##                              var frequency percentage cumulative_perc
## 1                          White     22507      82.22            82.2
## 2                          Black      1729       6.32            88.5
## 3                       Hispanic      1443       5.27            93.8
## 4                          Other       886       3.24            97.0
## 5 American Indian/Alaskan Native       542       1.98            99.0
## 6                          Asian       266       0.97           100.0
freq(HD$GenHealth)

##         var frequency percentage cumulative_perc
## 1      Good      9558      34.92            34.9
## 2      Fair      7084      25.88            60.8
## 3 Very good      5381      19.66            80.5
## 4      Poor      3850      14.06            94.5
## 5 Excellent      1500       5.48           100.0

When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.

Modelling

Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we have used a linear regression in our exploratory stages to find the best fitting model.

Liner Regression on Sex

reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
print('Model One BIC')
## [1] "Model One BIC"
BIC(reg_1)
## [1] 90503
summary (reg_1)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2521 -0.1024 -0.0805 -0.0587  0.9677 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.03e-03   2.29e-03     2.2    0.028 *  
## SexMale     3.85e-02   9.87e-04    39.0   <2e-16 ***
## BMI         2.20e-03   7.76e-05    28.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared:  0.0074, Adjusted R-squared:  0.00739 
## F-statistic: 1.19e+03 on 2 and 319792 DF,  p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
print('Model Two BIC')
## [1] "Model Two BIC"
BIC(reg_2)
## [1] 80963
summary (reg_2)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     SleepTime, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3585 -0.0918 -0.0726 -0.0425  1.0001 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.18e-02   3.41e-03   -6.40  1.5e-10 ***
## SexMale         4.22e-02   9.78e-04   43.10  < 2e-16 ***
## BMI             1.43e-03   7.70e-05   18.55  < 2e-16 ***
## PhysicalHealth  6.18e-03   6.41e-05   96.37  < 2e-16 ***
## MentalHealth   -4.95e-04   6.44e-05   -7.69  1.5e-14 ***
## SleepTime       3.95e-03   3.41e-04   11.58  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared:  0.0367, Adjusted R-squared:  0.0367 
## F-statistic: 2.44e+03 on 5 and 319789 DF,  p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
print('Model Three BIC')
## [1] "Model Three BIC"
BIC(reg_3)
## [1] 60289
summary (reg_3)
## 
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth + 
##     Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race, 
##     data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4570 -0.1220 -0.0529 -0.0020  1.0648 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.16e-02   5.25e-03  -11.74  < 2e-16 ***
## SexMale                 4.88e-02   9.57e-04   50.95  < 2e-16 ***
## BMI                     2.05e-03   7.62e-05   26.92  < 2e-16 ***
## PhysicalHealth          4.61e-03   6.32e-05   72.95  < 2e-16 ***
## MentalHealth            9.31e-04   6.39e-05   14.56  < 2e-16 ***
## Smoking                 3.50e-02   9.92e-04   35.27  < 2e-16 ***
## AlcoholDrinking        -2.25e-02   1.89e-03  -11.93  < 2e-16 ***
## SleepTime              -1.39e-03   3.33e-04   -4.16  3.1e-05 ***
## AgeCategory25-29       -6.29e-03   2.75e-03   -2.29   0.0222 *  
## AgeCategory30-34       -6.45e-03   2.69e-03   -2.40   0.0166 *  
## AgeCategory35-39       -5.83e-03   2.64e-03   -2.21   0.0269 *  
## AgeCategory40-44        1.14e-03   2.63e-03    0.43   0.6643    
## AgeCategory45-49        1.10e-02   2.61e-03    4.20  2.6e-05 ***
## AgeCategory50-54        2.92e-02   2.52e-03   11.57  < 2e-16 ***
## AgeCategory55-59        4.63e-02   2.44e-03   18.96  < 2e-16 ***
## AgeCategory60-64        6.98e-02   2.39e-03   29.17  < 2e-16 ***
## AgeCategory65-69        9.48e-02   2.39e-03   39.60  < 2e-16 ***
## AgeCategory70-74        1.32e-01   2.44e-03   54.00  < 2e-16 ***
## AgeCategory75-79        1.65e-01   2.65e-03   62.18  < 2e-16 ***
## AgeCategory80 or older  2.08e-01   2.58e-03   80.59  < 2e-16 ***
## RaceAsian              -2.45e-02   4.75e-03   -5.15  2.5e-07 ***
## RaceBlack              -2.01e-02   4.09e-03   -4.91  9.1e-07 ***
## RaceHispanic           -1.75e-02   4.03e-03   -4.35  1.4e-05 ***
## RaceOther              -1.30e-02   4.48e-03   -2.90   0.0037 ** 
## RaceWhite              -2.04e-02   3.73e-03   -5.48  4.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared:  0.0977, Adjusted R-squared:  0.0976 
## F-statistic: 1.44e+03 on 24 and 319770 DF,  p-value: <2e-16

With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.

VIF Table to select variables

ezids::xkablevif(reg_3, wide=FALSE)
VIFs of the model
V1
AgeCategory25-29 1.03
AgeCategory30-34 1.06
AgeCategory35-39 1.14
AgeCategory40-44 1.17
AgeCategory45-49 1.08
AgeCategory50-54 1.02
AgeCategory55-59 1.04
AgeCategory60-64 1.72
AgeCategory65-69 1.81
AgeCategory70-74 1.89
AgeCategory75-79 1.92
AgeCategory80 or older 1.95
AlcoholDrinking 2.10
BMI 2.28
MentalHealth 2.45
PhysicalHealth 2.47
RaceAsian 2.37
RaceBlack 1.99
RaceHispanic 2.10
RaceOther 2.51
RaceWhite 5.04
SexMale 5.77
SleepTime 3.00
Smoking 11.28

From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:

HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory

We will train and test the model to see the cost analysis of our predictions.

Logistic Regression

In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called Total) that had equal proportions of the Heart Disease category. The new dataset, called Total, has over 54000 observations that have been split into an 80-20 train and test sets.

Model Training

# Creating a new dataframe (called total) with equal proportions of Heart Disease values 
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
## [1] 27373
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
## [1] 27373
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
#summary(total)
#freq(total$HeartDisease)


# Convert to factor variables 
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)

# Split into 80-20 train-test sets

set.seed(1234) 

sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test  <- total[-sample, ]

# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)

#Chi-squared test for model fit 
anova(model, test = 'Chisq')
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: HeartDisease
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                            43795      60714             
## Sex              1      704     43794      60010  < 2e-16 ***
## BMI              1      510     43793      59499  < 2e-16 ***
## PhysicalHealth   1     2416     43792      57083  < 2e-16 ***
## MentalHealth     1       28     43791      57055  9.4e-08 ***
## AlcoholDrinking  1      124     43790      56931  < 2e-16 ***
## SleepTime        1       10     43789      56921   0.0016 ** 
## AgeCategory     12     9273     43777      47647  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.

# Exponential Coefficients  & Confidence Interval 
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')
Odds Ratio & Confidence Interval of Model
OR 2.5 % 97.5 %
(Intercept) 0.018 0.014 0.023
SexMale 2.106 2.014 2.203
BMI 1.041 1.037 1.044
PhysicalHealth 1.042 1.039 1.044
MentalHealth 1.021 1.018 1.024
AlcoholDrinking 0.797 0.722 0.879
SleepTime 0.947 0.933 0.960
AgeCategory25-29 0.916 0.685 1.224
AgeCategory30-34 1.602 1.246 2.068
AgeCategory35-39 1.865 1.465 2.387
AgeCategory40-44 3.091 2.463 3.908
AgeCategory45-49 4.059 3.265 5.089
AgeCategory50-54 7.385 5.990 9.188
AgeCategory55-59 10.309 8.398 12.778
AgeCategory60-64 14.405 11.766 17.810
AgeCategory65-69 18.361 15.012 22.681
AgeCategory70-74 27.403 22.402 33.855
AgeCategory75-79 33.746 27.513 41.797
AgeCategory80 or older 51.239 41.794 63.438

According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health,physical health, drinking, and sleep time also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.

Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.

Predicting and Assessing the Model

library(ggthemes)

# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction  <- predict( model, newdata = test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) + 
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) + 
theme_economist()

# Positive instance = 1 = Yes Heart Disease 
# Negative instance = 0 = No Heart Disease

After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right. This is an ideal density plot.

ROC-AUC

# user-defined different cost for false negative and false positive

source('unbalanced_functions.R')

cost.fp = 100
cost.fn = 200

roc_info <- ROCInfo( data = test, 
                     predict = "prediction", 
                               actual = "HeartDisease", 
                               cost.fp = 100,
                               cost.fn = 200 )
grid.draw(roc_info$plot)

From the plots, we can see that our Area Under the Curve is about 79%, which is an acceptable model for our purposes. Specifying our different costs for false positives ($100) and false negatives ($200), we see that our total loss cost is just under $382000.

Confusion Matrix

loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
Confusion Matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 14653 7276 21929
Actual 1 4691 17176 21867
Total 19344 24452 43796
#unloadPkg("regclass")

# 
cm_info <- ConfusionMatrixInfo( data = test, 
                                predict = "prediction", 
                                actual  = 'HeartDisease', 
                                cutoff  = .30 )
#ggthemr("flat")
plot(cm_info$plot)

Calculated Values for Model:

Accuracy = 0.727 Mis-classification = 0.27 Sensitivity = 0.78 Specificity = 0.66

Using the optimal cut-off value of 0.30 obtained from the ROC-AUC Loss function, we see that our model a is able to calculate more positive values than negative values. The accuracy of the model is about 73%, which is not a good model in the healthcare field but can still be used as a public early detection tool to bring about individual awareness. We see that our model shows high sensitivity and low specificity, this means that our model is great at estimating actual cases of whether an individual will have heart disease but tends to have a high false positive rate. In the grand scheme life, higher false poistives is not as bad as having a false negative rate. For instance, an individual A uses our model and is told that they have a higher likelihood of having heart issues in the future. However, turns out this is a false positive and A just ends up living a bit more healthier than they did before. However, it is the false negatives that we will need to adjust for in the future.